Notify lookup sync of gossip processing results #5722

dapplion · 2024-05-06T14:33:21Z

General problem

Lookup sync and gossip interact in very racy ways. They interact at all because we want to prevent downloading things via ReqResp that we are already processing. Note that this optimization carries complexity, which we may decide if it's worth it at all.

Consider the image above, we can receive an attestation referencing a block not yet imported into fork-choice at these moments in time:

A before processing: Easy, download and process the block
B during processing: Block is already processing, should skip download. However, we must consider:
- What to do if processing fails?
- Should we create a lookup at all? If not, what to do if we receive a second attestation referencing the child of this block?
C while waiting blobs: Block is fully processed but not yet in fork-choice. In this case we don't have to worry about processing failures, but the child lookup issue remains.
D after import: This trigger should never happen. If it does we will download all block components and get a BlockAreadyKnown error.

Issues with current unstable `da9d386`

Check da_checker before doing a block lookup request #5681

^ introduces the optimization to prevent downloads for triggers B and C. However, it fails to explicitly handle both pitfalls listed above. If processing fails, a lookup may get stuck.

Also, it creates a spammy situation where a lookup get dropped immediately if the block is still processing, once per attestation

Component search loop #5696

^ Fixed the spammy loop, but now it make the case of processing failure worse. Completed chains remain in a hashset for 60 seconds, so a failing lookup will be blocked from being retried for that period.

Proposed Changes

Sync lookup must be aware of gossip processing results. I think there's no way around that.

The least invasive way to achieve that IMO is:

If block is the processing cache, mark lookup request as processing
When gossip processing completes, send sync message BlockComponentProcessed

With the above we tackle:

Immediately retry lookups for blocks that fail processing, don't wait 60 seconds
Immediately continue or drop child lookup of blocks from gossip
No spammy loop, lookups are not marked completed until imported into fork-choice

Pitfalls

This PR relies on these assumptions:

If availability_checker.has_block(block) returns true, the block is permanently valid and is missing blobs
If reqresp_pre_import_cache.contains_key(block) returns true, a BlockComponentProcessed event MUST be emitted some time in the future

We don't have e2e to assert those conditions at the moment. I think code comments should be good for now but it would be great to codify them in the future somehow

Todo

Update lookup tests, behaviour has slightly changed

Closes #5693

Squashed commit of the following: commit e6c8955 Author: dapplion <[email protected]> Date: Mon May 6 22:52:48 2024 +0900 Notify lookup sync of gossip processing results

AgeManning · 2024-05-07T04:48:57Z

I was thinking about this more.

In order to avoid sync having to know anything about block processing results (explicitly) can't we do the following:

If a block request comes in and we are processing, then ignore the request
If a request for a child comes in and the parent is processing, ignore (or as an optimization, cache).

Eventually the block will either fail, or be processed. We will eventually get more requests for the chain of blocks and eventually it will succeed because the block will either be failed by the beacon processor, or it will be successful.

The issue is around just restarting the chains. To prevent loops, like we wait a slot or something.

Just thinking this would be away to avoid explicit communication from the beacon processor.

dapplion · 2024-05-07T06:11:10Z

If a block request comes in and we are processing, then ignore the request
If a request for a child comes in and the parent is processing, ignore (or as an optimization, cache).

For child lookups it a bit tricky. Assume you are bit behind, such that given this chain of blocks:

Block slot 10 <- peers attesting here
Block slot 9
Block slot 8
Block slot 7
Block slot 6 <- you are here

You have to create a lookup for slot 10, fetch block, attempt process, get unknown parent, fetch 9, etc until 6. If 6 is processing lookup 7 needs some event to make progress. So the beacon processor must tell sync that block 6 has been processed. Otherwise dropping all lookups child of 6 while 6 is processing sounds silly. You must leave child lookups of 6 paused somehow awaiting for their ancestor to process.

Sync lookup technically doesn't need to know about the processing result, it would be enough to have a signal to continue childs of 6. But then we have to add a new SyncMessage and have a dedicated handler in block lookups. While it reduces how much information the processor leaks, I don't see how it simplifies things. We already have a BlockComponentProcessed message and its handler, why not re-use it?

Note: The moment that sync lookup checks the chain's processing cache or da_checker we break the separation of concerns between sync lookup and gossip processing. If we are crossing that line, let's implement a proper informational complete fix.

Notice that this ignores will happen thousands of times, one per each attestation. Unless we have something like #5706

AgeManning · 2024-05-08T05:06:42Z

beacon_node/network/src/network_beacon_processor/sync_methods.rs

@@ -170,21 +172,20 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> {
            if reprocess_tx.try_send(reprocess_msg).is_err() {
                error!(self.log, "Failed to inform block import"; "source" => "rpc", "block_root" => %hash)
            };
-            if matches!(process_type, BlockProcessType::SingleBlock { .. }) {


Why are we removing this match?

It it redundant because we now only process single blocks here?

Yes, blobs are processed in process_rpc_blobs

beacon_node/network/src/network_beacon_processor/sync_methods.rs

beacon_node/network/src/sync/block_lookups/mod.rs

realbigsean · 2024-05-13T08:17:32Z

@mergify queue

mergify · 2024-05-13T08:17:41Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

dapplion · 2024-05-13T11:33:34Z

@mergify queue

mergify · 2024-05-13T11:33:41Z

queue

🛑 The pull request has been removed from the queue `default`

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

realbigsean · 2024-05-13T11:38:36Z

@mergify refresh

mergify · 2024-05-13T11:38:45Z

refresh

✅ Pull request refreshed

realbigsean · 2024-05-13T11:39:03Z

@mergify queue

mergify · 2024-05-13T11:39:11Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 93e0649

dapplion requested review from realbigsean, jimmygchen and AgeManning May 6, 2024 14:34

dapplion mentioned this pull request May 6, 2024

Component search loop #5696

Closed

realbigsean added ready-for-review The code is ready for review v5.2.0 Q2 2024 labels May 6, 2024

Notify lookup sync of gossip processing results

e6c8955

dapplion force-pushed the notify-lookup-sync-process-result branch from 90c30b3 to e6c8955 Compare May 7, 2024 02:40

michaelsproul mentioned this pull request May 7, 2024

Release v5.2.0 #5664

Merged

AgeManning reviewed May 8, 2024

View reviewed changes

AgeManning approved these changes May 9, 2024

View reviewed changes

dapplion added 3 commits May 13, 2024 09:17

Add tests

442cd02

Add GossipBlockProcessResult event

b243747

Re-add dropped comments

cd5c767

realbigsean reviewed May 13, 2024

View reviewed changes

beacon_node/network/src/network_beacon_processor/sync_methods.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/block_lookups/mod.rs Show resolved Hide resolved

Update beacon_node/network/src/network_beacon_processor/sync_methods.rs

ac94265

realbigsean approved these changes May 13, 2024

View reviewed changes

update test_lookup_disconnection_peer_left

114f41a

mergify bot merged commit 93e0649 into sigp:unstable May 13, 2024
27 checks passed

dapplion deleted the notify-lookup-sync-process-result branch May 13, 2024 14:33

dapplion mentioned this pull request May 24, 2024

Too many Searching for block components #5693

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notify lookup sync of gossip processing results #5722

Notify lookup sync of gossip processing results #5722

dapplion commented May 6, 2024 •

edited

Loading

AgeManning commented May 7, 2024

dapplion commented May 7, 2024 •

edited

Loading

AgeManning May 8, 2024

dapplion May 9, 2024

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024 •

edited

Loading

dapplion commented May 13, 2024

mergify bot commented May 13, 2024 •

edited

Loading

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024 •

edited

Loading

Notify lookup sync of gossip processing results #5722

Notify lookup sync of gossip processing results #5722

Conversation

dapplion commented May 6, 2024 • edited Loading

General problem

Issues with current unstable da9d386

Proposed Changes

Pitfalls

Todo

AgeManning commented May 7, 2024

dapplion commented May 7, 2024 • edited Loading

AgeManning May 8, 2024

Choose a reason for hiding this comment

dapplion May 9, 2024

Choose a reason for hiding this comment

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

dapplion commented May 13, 2024

mergify bot commented May 13, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024

✅ Pull request refreshed

realbigsean commented May 13, 2024

mergify bot commented May 13, 2024 • edited Loading

✅ The pull request has been merged automatically

dapplion commented May 6, 2024 •

edited

Loading

Issues with current unstable `da9d386`

dapplion commented May 7, 2024 •

edited

Loading

mergify bot commented May 13, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented May 13, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented May 13, 2024 •

edited

Loading